5420 Anomaly Detection | Assignment 2 - Joyce Ng (jn2901)¶

Credit Card Fraud Detection Modeling¶

The Office of Management and Enterprise Services in the State of Oklahoma has made its purchase credit card transactions available. This dataset contains information on purchases made through the purchase card programs administered by the state and higher education institutions. This analysis will utilized Histogram Based Outlier Score and Empirical Cumulative Distribution-based Outlier Detection to identify anomalies.

Table of Contents¶

  • Section 1: Data Preparation
    • 1.1 Load Libraries and Dataset
    • 1.2 Check Missing Values & Change Columns Names and Data Type
    • 1.3 Create Data Fields
  • Section 2: EDA
    • 2.1 Distribution of Variables
  • Section 3: Feature Engineering
    • 3.1 Feature Benchmarks
    • 3.2 Feature Creation
    • 3.3 Descriptive Statistics of Features
    • 3.3 Distribution of Features
  • Section 4: Histogram-Based Outlier Score (HBOS)
  • Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD)
  • Section 6: Models Predictions Comparison
  • Section 7: Conclusion

 

Section 1: Data Preparation ¶

1.1 Load Libraries and Dataset ¶

In [1]:
# Load libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.utils.data import generate_data
from pyod.models.hbos import HBOS
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.models.ecod import ECOD

# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
warnings.filterwarnings("ignore")
In [2]:
df = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/2 - Credit Card/purchase_credit_card.csv')
df.head()
Out[2]:
Year-Month Agency Number Agency Name Cardholder Last Name Cardholder First Initial Description Amount Vendor Transaction Date Posted Date Merchant Category Code (MCC)
0 201307 1000 OKLAHOMA STATE UNIVERSITY Mason C GENERAL PURCHASE 890.00 NACAS 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
1 201307 1000 OKLAHOMA STATE UNIVERSITY Mason C ROOM CHARGES 368.96 SHERATON HOTEL 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM SHERATON
2 201307 1000 OKLAHOMA STATE UNIVERSITY Massey J GENERAL PURCHASE 165.82 SEARS.COM 9300 07/29/2013 12:00:00 AM 07/31/2013 12:00:00 AM DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE...
3 201307 1000 OKLAHOMA STATE UNIVERSITY Massey T GENERAL PURCHASE 96.39 WAL-MART #0137 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM GROCERY STORES,AND SUPERMARKETS
4 201307 1000 OKLAHOMA STATE UNIVERSITY Mauro-Herrera M HAMMERMILL COPY PLUS COPY EA 125.96 STAPLES DIRECT 07/30/2013 12:00:00 AM 07/31/2013 12:00:00 AM STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT...
In [3]:
# Check dimensions of the dataset, we found it has 442,458 rows and 11 columns 
print(df.shape, "\n") # Check the shape of df
print(df.columns, "\n") # Check column names
print(df.info(), "\n") # Check info of the df
df.describe() # Get the Simple Summary Statistics
(442458, 11) 

Index(['Year-Month', 'Agency Number', 'Agency Name', 'Cardholder Last Name',
       'Cardholder First Initial', 'Description', 'Amount', 'Vendor',
       'Transaction Date', 'Posted Date', 'Merchant Category Code (MCC)'],
      dtype='object') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442458 entries, 0 to 442457
Data columns (total 11 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Year-Month                    442458 non-null  int64  
 1   Agency Number                 442458 non-null  int64  
 2   Agency Name                   442458 non-null  object 
 3   Cardholder Last Name          442458 non-null  object 
 4   Cardholder First Initial      442458 non-null  object 
 5   Description                   442458 non-null  object 
 6   Amount                        442458 non-null  float64
 7   Vendor                        442458 non-null  object 
 8   Transaction Date              442458 non-null  object 
 9   Posted Date                   442458 non-null  object 
 10  Merchant Category Code (MCC)  442458 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 37.1+ MB
None 

Out[3]:
Year-Month Agency Number Amount
count 442458.000000 442458.000000 4.424580e+05
mean 201357.284375 42785.860353 4.249912e+02
std 47.107417 33378.461293 5.266509e+03
min 201307.000000 1000.000000 -4.286304e+04
25% 201309.000000 1000.000000 3.091000e+01
50% 201401.000000 47700.000000 1.048900e+02
75% 201404.000000 76000.000000 3.450000e+02
max 201406.000000 98000.000000 1.903858e+06

1.2: Check Missing Values & Change Columns Names and Data Type ¶

In [4]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values
Out[4]:
Year-Month                      0
Agency Number                   0
Agency Name                     0
Cardholder Last Name            0
Cardholder First Initial        0
Description                     0
Amount                          0
Vendor                          0
Transaction Date                0
Posted Date                     0
Merchant Category Code (MCC)    0
dtype: int64

1.3: Create Data Fields ¶

In [5]:
# Change column names 
df.columns = ['Year_Month', 'Agency_Number', 'Agency_Name', 'Cardholder_Last_Name',
      'Cardholder_First_Initial', 'Description', 'Amount', 'Vendor', 'Transaction_Date',
      'Posted_Date', 'Merchant_Category']
In [6]:
# Creating separate Year and Month columns for future feature engineering
# Converting Transaction_Date and Posted_date from Python Object data type to datetime
df['Transaction_Date']=pd.to_datetime(df['Transaction_Date'])
df['Posted_Date']=pd.to_datetime(df['Posted_Date'])

df['Year_Month'].dtype # Check data type for information purpose
df['Year_Month']=pd.to_datetime(df['Year_Month'], format='%Y%m')
df['Year'] = df['Year_Month'].dt.year
df['Month'] = df['Year_Month'].dt.month

#Creating 'Week_Number' from 'Transaction_Date'
df['Week_Number'] = df['Transaction_Date'].dt.isocalendar().week

#Creating 'Day_of_Week' from 'Transaction_Date'
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df['Day_of_Week'] = df['Transaction_Date'].dt.dayofweek.map(dayOfWeek)

df['Day_of_Week'].value_counts()
Out[6]:
Day_of_Week
Wednesday    89644
Thursday     87537
Tuesday      86919
Friday       79917
Monday       61809
Saturday     26307
Sunday       10325
Name: count, dtype: int64
In [7]:
df.head()
Out[7]:
Year_Month Agency_Number Agency_Name Cardholder_Last_Name Cardholder_First_Initial Description Amount Vendor Transaction_Date Posted_Date Merchant_Category Year Month Week_Number Day_of_Week
0 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Mason C GENERAL PURCHASE 890.00 NACAS 2013-07-30 2013-07-31 CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS 2013 7 31 Tuesday
1 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Mason C ROOM CHARGES 368.96 SHERATON HOTEL 2013-07-30 2013-07-31 SHERATON 2013 7 31 Tuesday
2 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Massey J GENERAL PURCHASE 165.82 SEARS.COM 9300 2013-07-29 2013-07-31 DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE... 2013 7 31 Monday
3 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Massey T GENERAL PURCHASE 96.39 WAL-MART #0137 2013-07-30 2013-07-31 GROCERY STORES,AND SUPERMARKETS 2013 7 31 Tuesday
4 2013-07-01 1000 OKLAHOMA STATE UNIVERSITY Mauro-Herrera M HAMMERMILL COPY PLUS COPY EA 125.96 STAPLES DIRECT 2013-07-30 2013-07-31 STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... 2013 7 31 Tuesday
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442458 entries, 0 to 442457
Data columns (total 15 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   Year_Month                442458 non-null  datetime64[ns]
 1   Agency_Number             442458 non-null  int64         
 2   Agency_Name               442458 non-null  object        
 3   Cardholder_Last_Name      442458 non-null  object        
 4   Cardholder_First_Initial  442458 non-null  object        
 5   Description               442458 non-null  object        
 6   Amount                    442458 non-null  float64       
 7   Vendor                    442458 non-null  object        
 8   Transaction_Date          442458 non-null  datetime64[ns]
 9   Posted_Date               442458 non-null  datetime64[ns]
 10  Merchant_Category         442458 non-null  object        
 11  Year                      442458 non-null  int32         
 12  Month                     442458 non-null  int32         
 13  Week_Number               442458 non-null  UInt32        
 14  Day_of_Week               442458 non-null  object        
dtypes: UInt32(1), datetime64[ns](3), float64(1), int32(2), int64(1), object(7)
memory usage: 46.0+ MB

Section 2: EDA ¶

2.1: Distribution of Variables ¶

In [9]:
# Summarize the count statistics by agency_name in df_count
df_count = df['Agency_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Agency_Name','Count']

fig = px.bar(df_count, x='Agency_Name', y='Count', color = 'Agency_Name', width=1000, height=400)

sns.set_theme(style="whitegrid")
sns.barplot(y="Agency_Name", x="Count", data=df_count[1:20])
plt.title('Distribution of Agency', fontweight='bold')
plt.xlabel('Count')
plt.ylabel('Agency Name')
plt.show()
No description has been provided for this image

Distribution of Transaction Amount¶

In [10]:
# Set the style of the seaborn plot
sns.set_style("whitegrid")

# Create a histogram for transaction amounts with a log scale
plt.figure(figsize=(10, 6))
ax = sns.histplot(data=df['Amount'], bins=100, kde=True, log_scale=True)

# Format x-axis tick labels to add comma formatting
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: '${:,.0f}'.format(x) if x >= 1 else '${:.2f}'.format(x)))

plt.title('Distribution of Transaction Amounts', fontweight='bold')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.savefig('Distribution of Transaction Amount.png', transparent=True)

# Display the plot
plt.show()
No description has been provided for this image

Interaction of Transaction Amounts by Agency and Merchant Category¶

In [11]:
# Calculate the total transaction amount for each Agency and Merchant Category
df_grouped = df.groupby(['Agency_Name', 'Merchant_Category'])['Amount'].sum().reset_index()

# Take top 40 agencies for a clearer visualization
df_grouped = df_grouped.sort_values(by='Amount', ascending=False)
df_grouped = df_grouped.head(40)
# Sort DataFrame by Amount
df_grouped = df_grouped.sort_values(by='Amount', ascending=True)

# Create the scatter plot
fig = px.scatter(df_grouped, 
                 x="Agency_Name", 
                 y="Merchant_Category", 
                 color="Amount", 
                 size="Amount", 
                 title='Interaction of Transaction Amounts by Agency and Merchant Category',
                 width=1000, height=600,
                 color_continuous_scale='Viridis_r'
                )

# Customize layout to edit legend title and set background to white
fig.update_layout(
    coloraxis_colorbar=dict(
        title="Amount ($)"  # Legend title
    ),
    title={
        'text': 'Interaction of Transaction Amounts by Agency and Merchant Category',
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    paper_bgcolor="rgba(0,0,0,0)",  # Set the background color of the paper (outer) to transparent
    plot_bgcolor='white'  # Set the background color of the plotting area (inner) to white
)

# Customize x-axis
fig.update_xaxes(
    title = 'Agency Name',
    mirror=True,
    ticks='outside',
    showline=True,
    linecolor='lightgrey',
    gridcolor='lightgrey',
    showgrid=False,
    tickangle=-30
)

# Customize y-axis
fig.update_yaxes(
    title = 'Merchant Category',
    mirror=True,
    ticks='outside',
    showline=True,
    linecolor='lightgrey',
    gridcolor='lightgrey',
    showgrid=False
)

fig.write_image("Interactions of two or three variables.png")

# Show the plot
fig.show()

Section 3: Feature Engineering ¶

  1. Avg amount by agency, merchant
  2. Avg amount by agency, merchant in last 30 days
  3. Avg amount by agency, merchant in last 60 days
  4. Avg amount by agency, merchant in last 90 days
  5. Max amount by agency, merchant in last 30 days
  6. Max amount by agency, merchant in last 60 days
  7. Max amount by agency, merchant in last 90 days
  8. Avg transaction by agency, merchant
  9. Avg transaction by agency, merchant in last 30 days
  10. Avg transaction by agency, merchant in last 60 days
  11. Avg transaction by agency, merchant in last 90 days
  12. Avg among by agency, vendor
  13. Avg amount by agency, vendor in last 30 days
  14. Avg amount by agency, vendor in last 60 days
  15. Avg amount by agency, vendor in last 90 days

Features mainly focused on 4 main categories: Average Amount by Agency and Merchant, Maximum Amount by Agency and Merchant, Average Transaction Count by Agency and Merchant, Average Amount by Agency and Vendor. Within each category, we will look at the overall transaction data and across different time period i.e. in last 30, 60, 90 days to identify spending patterns and potential anomalies.

3.1 Feature Benchmarks¶

In [12]:
# Define functions and variables for benchmark calculations
def calculate_avg_amount(data, group_by_cols):
    return data.groupby(group_by_cols)['Amount'].mean().reset_index(name='Avg_Amount')

def calculate_max_amount(data, group_by_cols):
    return data.groupby(group_by_cols)['Amount'].max().reset_index(name='Max_Amount')

def calculate_avg_transaction(data, group_by_cols):
    return data.groupby(group_by_cols).size().reset_index(name='Avg_Transaction_Count')

def calculate_last_n_days(data, n):
    cutoff_date = data['Transaction_Date'].max() - pd.Timedelta(days=n)
    return data[data['Transaction_Date'] > cutoff_date]

# Define the group by columns
agency_merchant_cols = ['Agency_Name', 'Merchant_Category']
agency_vendor_cols = ['Agency_Name', 'Vendor']
In [13]:
# Calculate features
features = {}

# Feature 1: Average Amount by Agency, Merchant
features['avg_amount'] = calculate_avg_amount(df, agency_merchant_cols)
print("Feature 1: Average Amount by Agency, Merchant")
print(features['avg_amount'].head(),'\n')

# Feature 2: Average Amount by Agency, Merchant in last 30 days
features['avg_amount_30_days'] = calculate_avg_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
print("Feature 2: Average Amount by Agency, Merchant in last 30 days")
print(features['avg_amount_30_days'].head(),'\n')

# Feature 3: Average Amount by Agency, Merchant in last 60 days
features['avg_amount_60_days'] = calculate_avg_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
print("Feature 3: Average Amount by Agency, Merchant in last 60 days")
print(features['avg_amount_60_days'].head(),'\n')

# Feature 4: Average Amount by Agency, Merchant in last 90 days
features['avg_amount_90_days'] = calculate_avg_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
print("Feature 4: Average Amount by Agency, Merchant in last 90 days")
print(features['avg_amount_90_days'].head(),'\n')

# Feature 5: Maximum Amount by Agency, Merchant in last 30 days
features['max_amount_30_days'] = calculate_max_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
print("Feature 5: Maximum Amount by Agency, Merchant in last 30 days")
print(features['max_amount_30_days'].head(),'\n')

# Feature 6: Maximum Amount by Agency, Merchant in last 60 days
features['max_amount_60_days'] = calculate_max_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
print("Feature 6: Maximum Amount by Agency, Merchant in last 60 days")
print(features['max_amount_60_days'].head(),'\n')

# Feature 7: Maximum Amount by Agency, Merchant in last 90 days
features['max_amount_90_days'] = calculate_max_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
print("Feature 7: Maximum Amount by Agency, Merchant in last 90 days")
print(features['max_amount_90_days'].head(),'\n')

# Feature 8: Average Transaction by Agency, Merchant
features['avg_transaction'] = calculate_avg_transaction(df, agency_merchant_cols)
print("Feature 8: Average Transaction by Agency, Merchant")
print(features['avg_transaction'].head(),'\n')

# Feature 9: Average Transaction by Agency, Merchant in last 30 days
features['avg_transaction_30_days'] = calculate_avg_transaction(calculate_last_n_days(df, 30), agency_merchant_cols)
print("Feature 9: Average Transaction by Agency, Merchant in last 30 days")
print(features['avg_transaction_30_days'].head(),'\n')

# Feature 10: Average Transaction by Agency, Merchant in last 60 days
features['avg_transaction_60_days'] = calculate_avg_transaction(calculate_last_n_days(df, 60), agency_merchant_cols)
print("Feature 10: Average Transaction by Agency, Merchant in last 60 days")
print(features['avg_transaction_60_days'].head(),'\n')

# Feature 11: Average Transaction by Agency, Merchant in last 90 days
features['avg_transaction_90_days'] = calculate_avg_transaction(calculate_last_n_days(df, 90), agency_merchant_cols)
print("Feature 11: Average Transaction by Agency, Merchant in last 90 days")
print(features['avg_transaction_90_days'].head(),'\n')

# Feature 12: Average Amount by Agency, Vendor
features['avg_amount_vendor'] = calculate_avg_amount(df, agency_vendor_cols)
print("Feature 12: Average Amount by Agency, Vendor")
print(features['avg_amount_vendor'].head(),'\n')

# Feature 13: Average Amount by Agency, Vendor in last 30 days
features['avg_amount_vendor_30_days'] = calculate_avg_amount(calculate_last_n_days(df, 30), agency_vendor_cols)
print("Feature 13: Average Amount by Agency, Vendor in last 30 days")
print(features['avg_amount_vendor_30_days'].head(),'\n')

# Feature 14: Average Amount by Agency, Vendor in last 60 days
features['avg_amount_vendor_60_days'] = calculate_avg_amount(calculate_last_n_days(df, 60), agency_vendor_cols)
print("Feature 14: Average Amount by Agency, Vendor in last 60 days")
print(features['avg_amount_vendor_60_days'].head(),'\n')

# Feature 15: Average Amount by Agency, Vendor in last 90 days
features['avg_amount_vendor_90_days'] = calculate_avg_amount(calculate_last_n_days(df, 90), agency_vendor_cols)
print("Feature 15: Average Amount by Agency, Vendor in last 90 days")
print(features['avg_amount_vendor_90_days'].head())
Feature 1: Average Amount by Agency, Merchant
                       Agency_Name  \
0  ARDMORE HIGHER EDUCATION CENTER   
1  ARDMORE HIGHER EDUCATION CENTER   
2  ARDMORE HIGHER EDUCATION CENTER   
3  ARDMORE HIGHER EDUCATION CENTER   
4  ARDMORE HIGHER EDUCATION CENTER   

                            Merchant_Category  Avg_Amount  
0  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED  115.120000  
1                           CATALOG MERCHANTS  -53.344286  
2       COMPUTER NETWORK/INFORMATION SERVICES   18.340000  
3                  DRUG STORES AND PHARMACIES  153.450000  
4             GROCERY STORES,AND SUPERMARKETS   74.525455   

Feature 2: Average Amount by Agency, Merchant in last 30 days
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
3  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   
4  ATTORNEY GENERAL        CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS   

    Avg_Amount  
0  1040.000000  
1   462.781429  
2   436.428571  
3    24.590000  
4   875.000000   

Feature 3: Average Amount by Agency, Merchant in last 60 days
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL                                        BOOK STORES   
3  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
4  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   

    Avg_Amount  
0  1040.000000  
1   465.248125  
2   123.225000  
3   332.034545  
4   118.990000   

Feature 4: Average Amount by Agency, Merchant in last 90 days
        Agency_Name                           Merchant_Category   Avg_Amount
0  ATTORNEY GENERAL                        ADVERTISING SERVICES  1040.000000
1  ATTORNEY GENERAL                           AMERICAN AIRLINES   417.765667
2  ATTORNEY GENERAL     AUTOMOTIVE PARTS AND ACCESSORIES STORES    97.215000
3  ATTORNEY GENERAL                                 BOOK STORES   123.225000
4  ATTORNEY GENERAL  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   299.336923 

Feature 5: Maximum Amount by Agency, Merchant in last 30 days
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
3  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   
4  ATTORNEY GENERAL        CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS   

   Max_Amount  
0     1040.00  
1      982.00  
2     1000.00  
3       24.59  
4     1325.00   

Feature 6: Maximum Amount by Agency, Merchant in last 60 days
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL                                        BOOK STORES   
3  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
4  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   

   Max_Amount  
0     1040.00  
1      982.00  
2      208.50  
3     1000.00  
4      306.66   

Feature 7: Maximum Amount by Agency, Merchant in last 90 days
        Agency_Name                           Merchant_Category  Max_Amount
0  ATTORNEY GENERAL                        ADVERTISING SERVICES     1040.00
1  ATTORNEY GENERAL                           AMERICAN AIRLINES      982.00
2  ATTORNEY GENERAL     AUTOMOTIVE PARTS AND ACCESSORIES STORES      186.17
3  ATTORNEY GENERAL                                 BOOK STORES      208.50
4  ATTORNEY GENERAL  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED     1000.00 

Feature 8: Average Transaction by Agency, Merchant
                       Agency_Name  \
0  ARDMORE HIGHER EDUCATION CENTER   
1  ARDMORE HIGHER EDUCATION CENTER   
2  ARDMORE HIGHER EDUCATION CENTER   
3  ARDMORE HIGHER EDUCATION CENTER   
4  ARDMORE HIGHER EDUCATION CENTER   

                            Merchant_Category  Avg_Transaction_Count  
0  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED                      1  
1                           CATALOG MERCHANTS                      7  
2       COMPUTER NETWORK/INFORMATION SERVICES                      1  
3                  DRUG STORES AND PHARMACIES                      1  
4             GROCERY STORES,AND SUPERMARKETS                     11   

Feature 9: Average Transaction by Agency, Merchant in last 30 days
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
3  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   
4  ATTORNEY GENERAL        CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS   

   Avg_Transaction_Count  
0                      1  
1                      7  
2                      7  
3                      1  
4                      3   

Feature 10: Average Transaction by Agency, Merchant in last 60 days
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL                                        BOOK STORES   
3  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
4  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   

   Avg_Transaction_Count  
0                      1  
1                     16  
2                      2  
3                     11  
4                      3   

Feature 11: Average Transaction by Agency, Merchant in last 90 days
        Agency_Name                           Merchant_Category  \
0  ATTORNEY GENERAL                        ADVERTISING SERVICES   
1  ATTORNEY GENERAL                           AMERICAN AIRLINES   
2  ATTORNEY GENERAL     AUTOMOTIVE PARTS AND ACCESSORIES STORES   
3  ATTORNEY GENERAL                                 BOOK STORES   
4  ATTORNEY GENERAL  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   

   Avg_Transaction_Count  
0                      1  
1                     30  
2                      2  
3                      2  
4                     13   

Feature 12: Average Amount by Agency, Vendor
                       Agency_Name                Vendor  Avg_Amount
0  ARDMORE HIGHER EDUCATION CENTER         AGRI PRODUCTS   92.500000
1  ARDMORE HIGHER EDUCATION CENTER        CDW GOVERNMENT  -53.344286
2  ARDMORE HIGHER EDUCATION CENTER    COUNTY BUILDING CE   58.440000
3  ARDMORE HIGHER EDUCATION CENTER  EMPIRE PAPER COMPANY  443.540000
4  ARDMORE HIGHER EDUCATION CENTER           GODADDY.COM   18.340000 

Feature 13: Average Amount by Agency, Vendor in last 30 days
        Agency_Name                     Vendor  Avg_Amount
0  ATTORNEY GENERAL       ADI ASPEN PUBLISHERS      866.20
1  ATTORNEY GENERAL  AMERICAN AI 0017450939750     -250.03
2  ATTORNEY GENERAL  AMERICAN AI 0017456015142      982.00
3  ATTORNEY GENERAL  AMERICAN AI 0017459356062      669.00
4  ATTORNEY GENERAL  AMERICAN AI 0017459356079      687.50 

Feature 14: Average Amount by Agency, Vendor in last 60 days
        Agency_Name                     Vendor  Avg_Amount
0  ATTORNEY GENERAL       ADI ASPEN PUBLISHERS     866.200
1  ATTORNEY GENERAL       AMAZON MKTPLACE PMTS     123.225
2  ATTORNEY GENERAL  AMERICAN AI 0017450939750     205.985
3  ATTORNEY GENERAL  AMERICAN AI 0017450939785    -361.000
4  ATTORNEY GENERAL  AMERICAN AI 0017450939901     250.000 

Feature 15: Average Amount by Agency, Vendor in last 90 days
        Agency_Name                     Vendor  Avg_Amount
0  ATTORNEY GENERAL       ADI ASPEN PUBLISHERS     866.200
1  ATTORNEY GENERAL       AMAZON MKTPLACE PMTS     123.225
2  ATTORNEY GENERAL  AMERICAN AI 0017394725288     444.000
3  ATTORNEY GENERAL  AMERICAN AI 0017395886798     673.000
4  ATTORNEY GENERAL  AMERICAN AI 0017395886825       0.000

3.2 Feature Creation¶

In [14]:
# Calculate ratios between features
def calculate_ratios(features):
    ratios = {}
    
    # Define base features for ratio calculations (using overall avg amount as a baseline for simplicity)
    baseline_avg_amount = features['avg_amount']['Avg_Amount'].mean()
    baseline_transaction_count = features['avg_transaction']['Avg_Transaction_Count'].mean()
    baseline_avg_amount_vendor = features['avg_amount_vendor']['Avg_Amount'].mean()
    
    # Calculate ratios for average amounts
    ratios['avg_amount_ratio'] = features['avg_amount'].copy()
    if baseline_avg_amount != 0:
        ratios['avg_amount_ratio']['Avg_Amount_Ratio'] = features['avg_amount']['Avg_Amount'] / baseline_avg_amount
    else:
        ratios['avg_amount_ratio']['Avg_Amount_Ratio'] = 0
    
    ratios['avg_amount_30_days_ratio'] = features['avg_amount_30_days'].copy()
    if baseline_avg_amount != 0:
        ratios['avg_amount_30_days_ratio']['Avg_Amount_30_Days_Ratio'] = features['avg_amount_30_days']['Avg_Amount'] / baseline_avg_amount
    else:
        ratios['avg_amount_30_days_ratio']['Avg_Amount_30_Days_Ratio'] = 0
    
    ratios['avg_amount_60_days_ratio'] = features['avg_amount_60_days'].copy()
    if baseline_avg_amount != 0:
        ratios['avg_amount_60_days_ratio']['Avg_Amount_60_Days_Ratio'] = features['avg_amount_60_days']['Avg_Amount'] / baseline_avg_amount
    else:
        ratios['avg_amount_60_days_ratio']['Avg_Amount_60_Days_Ratio'] = 0
    
    ratios['avg_amount_90_days_ratio'] = features['avg_amount_90_days'].copy()
    if baseline_avg_amount != 0:
        ratios['avg_amount_90_days_ratio']['Avg_Amount_90_Days_Ratio'] = features['avg_amount_90_days']['Avg_Amount'] / baseline_avg_amount
    else:
        ratios['avg_amount_90_days_ratio']['Avg_Amount_90_Days_Ratio'] = 0
    
    # Calculate ratios for maximum amounts
    ratios['max_amount_30_days_ratio'] = features['max_amount_30_days'].copy()
    if baseline_avg_amount != 0:
        ratios['max_amount_30_days_ratio']['Max_Amount_30_Days_Ratio'] = features['max_amount_30_days']['Max_Amount'] / baseline_avg_amount
    else:
        ratios['max_amount_30_days_ratio']['Max_Amount_30_Days_Ratio'] = 0
    
    ratios['max_amount_60_days_ratio'] = features['max_amount_60_days'].copy()
    if baseline_avg_amount != 0:
        ratios['max_amount_60_days_ratio']['Max_Amount_60_Days_Ratio'] = features['max_amount_60_days']['Max_Amount'] / baseline_avg_amount
    else:
        ratios['max_amount_60_days_ratio']['Max_Amount_60_Days_Ratio'] = 0
    
    ratios['max_amount_90_days_ratio'] = features['max_amount_90_days'].copy()
    if baseline_avg_amount != 0:
        ratios['max_amount_90_days_ratio']['Max_Amount_90_Days_Ratio'] = features['max_amount_90_days']['Max_Amount'] / baseline_avg_amount
    else:
        ratios['max_amount_90_days_ratio']['Max_Amount_90_Days_Ratio'] = 0
    
    # Calculate ratios for average transaction counts
    ratios['avg_transaction_ratio'] = features['avg_transaction'].copy()
    if baseline_transaction_count != 0:
        ratios['avg_transaction_ratio']['Avg_Transaction_Ratio'] = features['avg_transaction']['Avg_Transaction_Count'] / baseline_transaction_count
    else:
        ratios['avg_transaction_ratio']['Avg_Transaction_Ratio'] = 0
    
    ratios['avg_transaction_30_days_ratio'] = features['avg_transaction_30_days'].copy()
    if baseline_transaction_count != 0:
        ratios['avg_transaction_30_days_ratio']['Avg_Transaction_30_Days_Ratio'] = features['avg_transaction_30_days']['Avg_Transaction_Count'] / baseline_transaction_count
    else:
        ratios['avg_transaction_30_days_ratio']['Avg_Transaction_30_Days_Ratio'] = 0
    
    ratios['avg_transaction_60_days_ratio'] = features['avg_transaction_60_days'].copy()
    if baseline_transaction_count != 0:
        ratios['avg_transaction_60_days_ratio']['Avg_Transaction_60_Days_Ratio'] = features['avg_transaction_60_days']['Avg_Transaction_Count'] / baseline_transaction_count
    else:
        ratios['avg_transaction_60_days_ratio']['Avg_Transaction_60_Days_Ratio'] = 0
    
    ratios['avg_transaction_90_days_ratio'] = features['avg_transaction_90_days'].copy()
    if baseline_transaction_count != 0:
        ratios['avg_transaction_90_days_ratio']['Avg_Transaction_90_Days_Ratio'] = features['avg_transaction_90_days']['Avg_Transaction_Count'] / baseline_transaction_count
    else:
        ratios['avg_transaction_90_days_ratio']['Avg_Transaction_90_Days_Ratio'] = 0
    
    # Calculate ratios for average amounts by vendor
    ratios['avg_amount_vendor_ratio'] = features['avg_amount_vendor'].copy()
    if baseline_avg_amount_vendor != 0:
        ratios['avg_amount_vendor_ratio']['Avg_Amount_Vendor_Ratio'] = features['avg_amount_vendor']['Avg_Amount'] / baseline_avg_amount_vendor
    else:
        ratios['avg_amount_vendor_ratio']['Avg_Amount_Vendor_Ratio'] = 0
    
    ratios['avg_amount_vendor_30_days_ratio'] = features['avg_amount_vendor_30_days'].copy()
    if baseline_avg_amount_vendor != 0:
        ratios['avg_amount_vendor_30_days_ratio']['Avg_Amount_Vendor_30_Days_Ratio'] = features['avg_amount_vendor_30_days']['Avg_Amount'] / baseline_avg_amount_vendor
    else:
        ratios['avg_amount_vendor_30_days_ratio']['Avg_Amount_Vendor_30_Days_Ratio'] = 0
    
    ratios['avg_amount_vendor_60_days_ratio'] = features['avg_amount_vendor_60_days'].copy()
    if baseline_avg_amount_vendor != 0:
        ratios['avg_amount_vendor_60_days_ratio']['Avg_Amount_Vendor_60_Days_Ratio'] = features['avg_amount_vendor_60_days']['Avg_Amount'] / baseline_avg_amount_vendor
    else:
        ratios['avg_amount_vendor_60_days_ratio']['Avg_Amount_Vendor_60_Days_Ratio'] = 0
    
    ratios['avg_amount_vendor_90_days_ratio'] = features['avg_amount_vendor_90_days'].copy()
    if baseline_avg_amount_vendor != 0:
        ratios['avg_amount_vendor_90_days_ratio']['Avg_Amount_Vendor_90_Days_Ratio'] = features['avg_amount_vendor_90_days']['Avg_Amount'] / baseline_avg_amount_vendor
    else:
        ratios['avg_amount_vendor_90_days_ratio']['Avg_Amount_Vendor_90_Days_Ratio'] = 0
    
    return ratios

# Calculate and print ratios
ratios = calculate_ratios(features)

# Displaying ratios for validation
for key, value in ratios.items():
    print(f"Ratios for {key}")
    print(value.head(),'\n')
Ratios for avg_amount_ratio
                       Agency_Name  \
0  ARDMORE HIGHER EDUCATION CENTER   
1  ARDMORE HIGHER EDUCATION CENTER   
2  ARDMORE HIGHER EDUCATION CENTER   
3  ARDMORE HIGHER EDUCATION CENTER   
4  ARDMORE HIGHER EDUCATION CENTER   

                            Merchant_Category  Avg_Amount  Avg_Amount_Ratio  
0  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED  115.120000          0.218589  
1                           CATALOG MERCHANTS  -53.344286         -0.101290  
2       COMPUTER NETWORK/INFORMATION SERVICES   18.340000          0.034824  
3                  DRUG STORES AND PHARMACIES  153.450000          0.291370  
4             GROCERY STORES,AND SUPERMARKETS   74.525455          0.141509   

Ratios for avg_amount_30_days_ratio
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
3  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   
4  ATTORNEY GENERAL        CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS   

    Avg_Amount  Avg_Amount_30_Days_Ratio  
0  1040.000000                  1.974746  
1   462.781429                  0.878727  
2   436.428571                  0.828688  
3    24.590000                  0.046691  
4   875.000000                  1.661445   

Ratios for avg_amount_60_days_ratio
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL                                        BOOK STORES   
3  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
4  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   

    Avg_Amount  Avg_Amount_60_Days_Ratio  
0  1040.000000                  1.974746  
1   465.248125                  0.883410  
2   123.225000                  0.233979  
3   332.034545                  0.630465  
4   118.990000                  0.225938   

Ratios for avg_amount_90_days_ratio
        Agency_Name                           Merchant_Category   Avg_Amount  \
0  ATTORNEY GENERAL                        ADVERTISING SERVICES  1040.000000   
1  ATTORNEY GENERAL                           AMERICAN AIRLINES   417.765667   
2  ATTORNEY GENERAL     AUTOMOTIVE PARTS AND ACCESSORIES STORES    97.215000   
3  ATTORNEY GENERAL                                 BOOK STORES   123.225000   
4  ATTORNEY GENERAL  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   299.336923   

   Avg_Amount_90_Days_Ratio  
0                  1.974746  
1                  0.793251  
2                  0.184591  
3                  0.233979  
4                  0.568379   

Ratios for max_amount_30_days_ratio
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
3  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   
4  ATTORNEY GENERAL        CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS   

   Max_Amount  Max_Amount_30_Days_Ratio  
0     1040.00                  1.974746  
1      982.00                  1.864616  
2     1000.00                  1.898794  
3       24.59                  0.046691  
4     1325.00                  2.515902   

Ratios for max_amount_60_days_ratio
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL                                        BOOK STORES   
3  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
4  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   

   Max_Amount  Max_Amount_60_Days_Ratio  
0     1040.00                  1.974746  
1      982.00                  1.864616  
2      208.50                  0.395899  
3     1000.00                  1.898794  
4      306.66                  0.582284   

Ratios for max_amount_90_days_ratio
        Agency_Name                           Merchant_Category  Max_Amount  \
0  ATTORNEY GENERAL                        ADVERTISING SERVICES     1040.00   
1  ATTORNEY GENERAL                           AMERICAN AIRLINES      982.00   
2  ATTORNEY GENERAL     AUTOMOTIVE PARTS AND ACCESSORIES STORES      186.17   
3  ATTORNEY GENERAL                                 BOOK STORES      208.50   
4  ATTORNEY GENERAL  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED     1000.00   

   Max_Amount_90_Days_Ratio  
0                  1.974746  
1                  1.864616  
2                  0.353499  
3                  0.395899  
4                  1.898794   

Ratios for avg_transaction_ratio
                       Agency_Name  \
0  ARDMORE HIGHER EDUCATION CENTER   
1  ARDMORE HIGHER EDUCATION CENTER   
2  ARDMORE HIGHER EDUCATION CENTER   
3  ARDMORE HIGHER EDUCATION CENTER   
4  ARDMORE HIGHER EDUCATION CENTER   

                            Merchant_Category  Avg_Transaction_Count  \
0  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED                      1   
1                           CATALOG MERCHANTS                      7   
2       COMPUTER NETWORK/INFORMATION SERVICES                      1   
3                  DRUG STORES AND PHARMACIES                      1   
4             GROCERY STORES,AND SUPERMARKETS                     11   

   Avg_Transaction_Ratio  
0               0.019046  
1               0.133321  
2               0.019046  
3               0.019046  
4               0.209505   

Ratios for avg_transaction_30_days_ratio
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
3  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   
4  ATTORNEY GENERAL        CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS   

   Avg_Transaction_Count  Avg_Transaction_30_Days_Ratio  
0                      1                       0.019046  
1                      7                       0.133321  
2                      7                       0.133321  
3                      1                       0.019046  
4                      3                       0.057138   

Ratios for avg_transaction_60_days_ratio
        Agency_Name                                  Merchant_Category  \
0  ATTORNEY GENERAL                               ADVERTISING SERVICES   
1  ATTORNEY GENERAL                                  AMERICAN AIRLINES   
2  ATTORNEY GENERAL                                        BOOK STORES   
3  ATTORNEY GENERAL         BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   
4  ATTORNEY GENERAL  CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...   

   Avg_Transaction_Count  Avg_Transaction_60_Days_Ratio  
0                      1                       0.019046  
1                     16                       0.304734  
2                      2                       0.038092  
3                     11                       0.209505  
4                      3                       0.057138   

Ratios for avg_transaction_90_days_ratio
        Agency_Name                           Merchant_Category  \
0  ATTORNEY GENERAL                        ADVERTISING SERVICES   
1  ATTORNEY GENERAL                           AMERICAN AIRLINES   
2  ATTORNEY GENERAL     AUTOMOTIVE PARTS AND ACCESSORIES STORES   
3  ATTORNEY GENERAL                                 BOOK STORES   
4  ATTORNEY GENERAL  BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED   

   Avg_Transaction_Count  Avg_Transaction_90_Days_Ratio  
0                      1                       0.019046  
1                     30                       0.571376  
2                      2                       0.038092  
3                      2                       0.038092  
4                     13                       0.247596   

Ratios for avg_amount_vendor_ratio
                       Agency_Name                Vendor  Avg_Amount  \
0  ARDMORE HIGHER EDUCATION CENTER         AGRI PRODUCTS   92.500000   
1  ARDMORE HIGHER EDUCATION CENTER        CDW GOVERNMENT  -53.344286   
2  ARDMORE HIGHER EDUCATION CENTER    COUNTY BUILDING CE   58.440000   
3  ARDMORE HIGHER EDUCATION CENTER  EMPIRE PAPER COMPANY  443.540000   
4  ARDMORE HIGHER EDUCATION CENTER           GODADDY.COM   18.340000   

   Avg_Amount_Vendor_Ratio  
0                 0.206872  
1                -0.119302  
2                 0.130699  
3                 0.991959  
4                 0.041017   

Ratios for avg_amount_vendor_30_days_ratio
        Agency_Name                     Vendor  Avg_Amount  \
0  ATTORNEY GENERAL       ADI ASPEN PUBLISHERS      866.20   
1  ATTORNEY GENERAL  AMERICAN AI 0017450939750     -250.03   
2  ATTORNEY GENERAL  AMERICAN AI 0017456015142      982.00   
3  ATTORNEY GENERAL  AMERICAN AI 0017459356062      669.00   
4  ATTORNEY GENERAL  AMERICAN AI 0017459356079      687.50   

   Avg_Amount_Vendor_30_Days_Ratio  
0                         1.937220  
1                        -0.559182  
2                         2.196202  
3                         1.496191  
4                         1.537565   

Ratios for avg_amount_vendor_60_days_ratio
        Agency_Name                     Vendor  Avg_Amount  \
0  ATTORNEY GENERAL       ADI ASPEN PUBLISHERS     866.200   
1  ATTORNEY GENERAL       AMAZON MKTPLACE PMTS     123.225   
2  ATTORNEY GENERAL  AMERICAN AI 0017450939750     205.985   
3  ATTORNEY GENERAL  AMERICAN AI 0017450939785    -361.000   
4  ATTORNEY GENERAL  AMERICAN AI 0017450939901     250.000   

   Avg_Amount_Vendor_60_Days_Ratio  
0                         1.937220  
1                         0.275588  
2                         0.460677  
3                        -0.807361  
4                         0.559115   

Ratios for avg_amount_vendor_90_days_ratio
        Agency_Name                     Vendor  Avg_Amount  \
0  ATTORNEY GENERAL       ADI ASPEN PUBLISHERS     866.200   
1  ATTORNEY GENERAL       AMAZON MKTPLACE PMTS     123.225   
2  ATTORNEY GENERAL  AMERICAN AI 0017394725288     444.000   
3  ATTORNEY GENERAL  AMERICAN AI 0017395886798     673.000   
4  ATTORNEY GENERAL  AMERICAN AI 0017395886825       0.000   

   Avg_Amount_Vendor_90_Days_Ratio  
0                         1.937220  
1                         0.275588  
2                         0.992987  
3                         1.505136  
4                         0.000000   

3.3 Descriptive Statistics of Features¶

In [15]:
# Function to display descriptive statistics of ratios
def display_ratio_stats(ratios):
    stats = {}
    for key, value in ratios.items():
        stats[key] = value.describe()
    
    # Combine statistics into a single dataframe for easier viewing
    stats_df = pd.concat(stats, axis=1)
    
    return stats_df

# Display descriptive statistics of ratios
ratio_stats = display_ratio_stats(ratios)
print("Descriptive Statistics of Ratios:", '\n')
print(ratio_stats)
Descriptive Statistics of Ratios: 

      avg_amount_ratio                  avg_amount_30_days_ratio  \
            Avg_Amount Avg_Amount_Ratio               Avg_Amount   
count      8427.000000      8427.000000              3372.000000   
mean        526.650020         1.000000               529.500247   
std        2412.571766         4.580977              1304.567007   
min       -3047.263333        -5.786126             -2393.700000   
25%         111.995000         0.212655                86.884113   
50%         269.450000         0.511630               240.048810   
75%         527.840500         1.002260               556.442353   
max      171619.610000       325.870319             45784.640000   

                               avg_amount_60_days_ratio  \
      Avg_Amount_30_Days_Ratio               Avg_Amount   
count              3372.000000              4420.000000   
mean                  1.005412               542.288064   
std                   2.477104              2266.461185   
min                  -4.545144              -995.600000   
25%                   0.164975                96.037500   
50%                   0.455803               265.452500   
75%                   1.056570               551.635076   
max                  86.935609            134734.050000   

                               avg_amount_90_days_ratio  \
      Avg_Amount_60_Days_Ratio               Avg_Amount   
count              4420.000000              5258.000000   
mean                  1.029693               538.989403   
std                   4.303543              3089.953297   
min                  -1.890440              -842.720000   
25%                   0.182355               102.350694   
50%                   0.504040               270.000000   
75%                   1.047441               555.795938   
max                 255.832232            214206.120000   

                               max_amount_30_days_ratio  \
      Avg_Amount_90_Days_Ratio               Max_Amount   
count              5258.000000              3372.000000   
mean                  1.023430              1552.831412   
std                   5.867185              5522.956927   
min                  -1.600152             -2393.700000   
25%                   0.194343               118.747500   
50%                   0.512674               425.155000   
75%                   1.055342              1188.075000   
max                 406.733337            132790.140000   

                                ... avg_transaction_90_days_ratio  \
      Max_Amount_30_Days_Ratio  ...         Avg_Transaction_Count   
count              3372.000000  ...                   5258.000000   
mean                  2.948507  ...                     21.052301   
std                  10.486959  ...                     93.690020   
min                  -4.545144  ...                      1.000000   
25%                   0.225477  ...                      1.000000   
50%                   0.807282  ...                      3.000000   
75%                   2.255910  ...                     10.000000   
max                 252.141147  ...                   2146.000000   

                                    avg_amount_vendor_ratio  \
      Avg_Transaction_90_Days_Ratio              Avg_Amount   
count                   5258.000000            1.119170e+05   
mean                       0.400960            4.471355e+02   
std                        1.784408            6.534493e+03   
min                        0.019046           -4.907220e+03   
25%                        0.019046            2.900000e+01   
50%                        0.057138            1.561667e+02   
75%                        0.190459            4.560000e+02   
max                       40.872449            1.827119e+06   

                              avg_amount_vendor_30_days_ratio  \
      Avg_Amount_Vendor_Ratio                      Avg_Amount   
count           111917.000000                    14965.000000   
mean                 1.000000                      507.333145   
std                 14.614122                     1852.339922   
min                -10.974793                    -4987.040000   
25%                  0.064857                       40.680000   
50%                  0.349260                      162.802778   
75%                  1.019825                      474.000000   
max               4086.275878                    60347.380000   

                                      avg_amount_vendor_60_days_ratio  \
      Avg_Amount_Vendor_30_Days_Ratio                      Avg_Amount   
count                    14965.000000                    26124.000000   
mean                         1.134629                      492.806725   
std                          4.142681                     2992.894701   
min                        -11.153307                    -4999.000000   
25%                          0.090979                       37.000000   
50%                          0.364102                      162.000000   
75%                          1.060081                      475.000000   
max                        134.964400                   348053.750000   

                                      avg_amount_vendor_90_days_ratio  \
      Avg_Amount_Vendor_60_Days_Ratio                      Avg_Amount   
count                    26124.000000                    37443.000000   
mean                         1.102142                      467.539420   
std                          6.693484                     2442.289144   
min                        -11.180055                    -3889.570000   
25%                          0.082749                       36.185000   
50%                          0.362306                      164.430000   
75%                          1.062318                      478.858333   
max                        778.407704                   298416.860000   

                                       
      Avg_Amount_Vendor_90_Days_Ratio  
count                    37443.000000  
mean                         1.045632  
std                          5.462078  
min                         -8.698861  
25%                          0.080926  
50%                          0.367741  
75%                          1.070947  
max                        667.396869  

[8 rows x 30 columns]

3.4 Distribution of Features¶

In [16]:
# Define function to create and plot percentile bins
def plot_percentile_bins(df, var):
    binned_var = var + '_bin'
    percentile = [0, 0.01, 0.05, 0.2, 0.5, 0.8, 0.95, 0.99, 1]
    df[binned_var] = pd.qcut(df[var], percentile, duplicates='drop')
    
    # Create the bin labels
    df['bin_label'] = df[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
    
    # Count the occurrences in each bin
    bin_counts = df['bin_label'].value_counts().sort_index().reset_index()
    bin_counts.columns = ['bin_label', 'count']
    
    # Create the plot
    fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
                 color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
    
    # Update the layout
    fig.update_layout(
        title=f'Distribution of {var} Ratio',
        xaxis_title='Quantile Range',
        yaxis_title='Count',
        template='plotly_white'
    )
    
    # Show the plot
    fig.show()

# Plot all ratio columns
for key, df_ratio in ratios.items():
    ratio_col = df_ratio.columns[-1]
    plot_percentile_bins(df_ratio, ratio_col)

Section 4: Histogram-Based Outlier Score (HBOS)¶

Approach:

  • HBOS uses histograms to model the distribution of each feature independently. It divides the range of each feature into bins and computes the density of data points in each bin.

Scoring:

  • The anomaly score is based on the inverse of the density; regions with lower density (fewer data points) are considered more anomalous.

Advantages:

  • Simple and fast, particularly effective for univariate data or when features are independent.

Limitations:

  • May not capture interactions between features effectively. Sensitive to bin width and outliers.
In [17]:
# Define functions
def count_stat(vector):
    unique, counts = np.unique(vector, return_counts=True)
    return dict(zip(unique, counts))

def descriptive_stat_threshold(df, pred_scores, threshold):
    df['Anomaly_Score'] = pred_scores
    df['Group'] = np.where(df['Anomaly_Score'] < threshold, 'Normal', 'Outlier')

    # Calculate count and percentage
    cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score': 'Count'})
    cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100

    # Calculate mean statistics
    stat = df.groupby('Group').mean().round(2).reset_index()

    # Merge count and mean statistics
    stat = cnt.merge(stat, on='Group')
    return stat

# Initialize parameters for hyperparameter tuning
param_grid = {'n_bins': [600, 700, 800], 'contamination': [0.01, 0.02, 0.03, 0.04, 0.05], 'alpha': [0.1, 0.2, 0.5, 1.0]}
In [18]:
# Prepare to store results for each ratio
descriptive_stats = {}

# Apply HBOS to each ratio DataFrame and calculate descriptive statistics
for key, df in ratios.items():
    ratio_col = [col for col in df.columns if col.endswith('Ratio')][0]

    # Prepare the data for HBOS
    X = df[[ratio_col]].values

    # Split data into train and test sets
    X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

    # Standardize data
    X_train_norm, X_test_norm = standardizer(X_train, X_test)

    # Initialize HBOS model
    hbos = HBOS()

    # Perform grid search for hyperparameter tuning
    grid_search = GridSearchCV(estimator=hbos, param_grid=param_grid, scoring='roc_auc')
    grid_search.fit(X_train_norm)

    # Get the best hyperparameters
    best_params = grid_search.best_params_

    # Use the best hyperparameters to train the final HBOS model
    best_hbos = HBOS(**best_params)
    best_hbos.fit(X_train_norm)

    # Get the anomaly scores and predictions
    y_train_scores = best_hbos.decision_function(X_train_norm)
    y_train_pred = best_hbos.predict(X_train_norm)
    y_test_scores = best_hbos.decision_function(X_test_norm)
    y_test_pred = best_hbos.predict(X_test_norm)

    # Plot histogram of training scores using Plotly
    fig = make_subplots(rows=1, cols=2, subplot_titles=('Training Data', 'Test Data'))

    fig.add_trace(
        go.Histogram(x=y_train_scores, name='Training'),
        row=1, col=1
    )

    fig.add_trace(
        go.Histogram(x=y_test_scores, name='Test'),
        row=1, col=2
    )

    fig.update_layout(
        title=f"Outlier score for {ratio_col}",
        xaxis_title="Outlier Score",
        yaxis_title="Frequency",
        showlegend=True,
        template='seaborn',
        height = 450, width = 1000,
        paper_bgcolor='rgba(0,0,0,0)'
    )
    
    fig.show()
    
    # Print threshold and count statistics
    print(f"Best hyperparameters for {ratio_col}: {best_params}")
    threshold = best_hbos.threshold_
    print(f"The threshold for {ratio_col}: {threshold}")
    print(f"The training data (counts) for {ratio_col}: {count_stat(y_train_pred)}")
    print(f"The test data (counts) for {ratio_col}: {count_stat(y_test_pred)}", '\n')
    
    # Get the descriptive statistics for outliers
    stats_train = descriptive_stat_threshold(pd.DataFrame(X_train, columns=[ratio_col]), y_train_scores, threshold)
    stats_test = descriptive_stat_threshold(pd.DataFrame(X_test, columns=[ratio_col]), y_test_scores, threshold)

    # Store the statistics
    descriptive_stats[ratio_col] = {'train': stats_train, 'test': stats_test}

# Print the descriptive statistics for each ratio
for ratio_col, stats in descriptive_stats.items():
    print(f"Descriptive Statistics for {ratio_col} (Training Data):\n", stats['train'], "\n")
    print(f"Descriptive Statistics for {ratio_col} (Test Data):\n", stats['test'], "\n")
Best hyperparameters for Avg_Amount_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Ratio: 3.155841153404613
The training data (counts) for Avg_Amount_Ratio: {0: 6677, 1: 64}
The test data (counts) for Avg_Amount_Ratio: {0: 1675, 1: 11} 

Best hyperparameters for Avg_Amount_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_30_Days_Ratio: 3.230707405611403
The training data (counts) for Avg_Amount_30_Days_Ratio: {0: 2679, 1: 18}
The test data (counts) for Avg_Amount_30_Days_Ratio: {0: 661, 1: 14} 

Best hyperparameters for Avg_Amount_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_60_Days_Ratio: 3.192610297265351
The training data (counts) for Avg_Amount_60_Days_Ratio: {0: 3501, 1: 35}
The test data (counts) for Avg_Amount_60_Days_Ratio: {0: 879, 1: 5} 

Best hyperparameters for Avg_Amount_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_90_Days_Ratio: 3.1087878209224846
The training data (counts) for Avg_Amount_90_Days_Ratio: {0: 4164, 1: 42}
The test data (counts) for Avg_Amount_90_Days_Ratio: {0: 1039, 1: 13} 

Best hyperparameters for Max_Amount_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Max_Amount_30_Days_Ratio: 3.19238761707751
The training data (counts) for Max_Amount_30_Days_Ratio: {0: 2674, 1: 23}
The test data (counts) for Max_Amount_30_Days_Ratio: {0: 658, 1: 17} 

Best hyperparameters for Max_Amount_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Max_Amount_60_Days_Ratio: 3.2619749812774237
The training data (counts) for Max_Amount_60_Days_Ratio: {0: 3506, 1: 30}
The test data (counts) for Max_Amount_60_Days_Ratio: {0: 869, 1: 15} 

Best hyperparameters for Max_Amount_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Max_Amount_90_Days_Ratio: 3.2202460330833116
The training data (counts) for Max_Amount_90_Days_Ratio: {0: 4174, 1: 32}
The test data (counts) for Max_Amount_90_Days_Ratio: {0: 1041, 1: 11} 

Best hyperparameters for Avg_Transaction_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_Ratio: 3.2269790186497085
The training data (counts) for Avg_Transaction_Ratio: {0: 6686, 1: 55}
The test data (counts) for Avg_Transaction_Ratio: {0: 1666, 1: 20} 

Best hyperparameters for Avg_Transaction_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_30_Days_Ratio: 3.14352593082513
The training data (counts) for Avg_Transaction_30_Days_Ratio: {0: 2674, 1: 23}
The test data (counts) for Avg_Transaction_30_Days_Ratio: {0: 652, 1: 23} 

Best hyperparameters for Avg_Transaction_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_60_Days_Ratio: 3.2052410942068272
The training data (counts) for Avg_Transaction_60_Days_Ratio: {0: 3508, 1: 28}
The test data (counts) for Avg_Transaction_60_Days_Ratio: {0: 860, 1: 24} 

Best hyperparameters for Avg_Transaction_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_90_Days_Ratio: 3.237319540474148
The training data (counts) for Avg_Transaction_90_Days_Ratio: {0: 4164, 1: 42}
The test data (counts) for Avg_Transaction_90_Days_Ratio: {0: 1031, 1: 21} 

Best hyperparameters for Avg_Amount_Vendor_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_Ratio: 2.4561038287135952
The training data (counts) for Avg_Amount_Vendor_Ratio: {0: 88965, 1: 568}
The test data (counts) for Avg_Amount_Vendor_Ratio: {0: 22242, 1: 142} 

Best hyperparameters for Avg_Amount_Vendor_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_30_Days_Ratio: 3.183548674471341
The training data (counts) for Avg_Amount_Vendor_30_Days_Ratio: {0: 11856, 1: 116}
The test data (counts) for Avg_Amount_Vendor_30_Days_Ratio: {0: 2970, 1: 23} 

Best hyperparameters for Avg_Amount_Vendor_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_60_Days_Ratio: 3.0824624891173382
The training data (counts) for Avg_Amount_Vendor_60_Days_Ratio: {0: 20694, 1: 205}
The test data (counts) for Avg_Amount_Vendor_60_Days_Ratio: {0: 5175, 1: 50} 

Best hyperparameters for Avg_Amount_Vendor_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_90_Days_Ratio: 3.13566568912757
The training data (counts) for Avg_Amount_Vendor_90_Days_Ratio: {0: 29708, 1: 246}
The test data (counts) for Avg_Amount_Vendor_90_Days_Ratio: {0: 7435, 1: 54} 

Descriptive Statistics for Avg_Amount_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Ratio  Anomaly_Score
0   Normal   6668  98.917075              0.81          -1.16
1  Outlier     73   1.082925             20.00           3.24 

Descriptive Statistics for Avg_Amount_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Ratio  Anomaly_Score
0   Normal   1673  99.228944              0.83          -1.12
1  Outlier     13   0.771056             13.54           3.26 

Descriptive Statistics for Avg_Amount_30_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_30_Days_Ratio  Anomaly_Score
0   Normal   2668  98.924731                      0.88          -0.43
1  Outlier     29   1.075269                     12.52           3.23 

Descriptive Statistics for Avg_Amount_30_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_30_Days_Ratio  Anomaly_Score
0   Normal    656  97.185185                      0.83          -0.43
1  Outlier     19   2.814815                      7.71           3.28 

Descriptive Statistics for Avg_Amount_60_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_60_Days_Ratio  Anomaly_Score
0   Normal   3489  98.670814                      0.83          -1.00
1  Outlier     47   1.329186                     17.90           3.24 

Descriptive Statistics for Avg_Amount_60_Days_Ratio (Test Data):
      Group  Count  Count %  Avg_Amount_60_Days_Ratio  Anomaly_Score
0   Normal    875  98.9819                      0.84          -0.98
1  Outlier      9   1.0181                     10.23           3.23 

Descriptive Statistics for Avg_Amount_90_Days_Ratio (Training Data):
      Group  Count   Count %  Avg_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   4143  98.50214                      0.81          -1.37
1  Outlier     63   1.49786                     16.67           3.18 

Descriptive Statistics for Avg_Amount_90_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   1033  98.193916                      0.80          -1.36
1  Outlier     19   1.806084                      8.56           3.19 

Descriptive Statistics for Max_Amount_30_Days_Ratio (Training Data):
      Group  Count    Count %  Max_Amount_30_Days_Ratio  Anomaly_Score
0   Normal   2657  98.516871                      2.16          -1.19
1  Outlier     40   1.483129                     53.95           3.19 

Descriptive Statistics for Max_Amount_30_Days_Ratio (Test Data):
      Group  Count    Count %  Max_Amount_30_Days_Ratio  Anomaly_Score
0   Normal    657  97.333333                      1.99          -1.21
1  Outlier     18   2.666667                     40.39           3.26 

Descriptive Statistics for Max_Amount_60_Days_Ratio (Training Data):
      Group  Count    Count %  Max_Amount_60_Days_Ratio  Anomaly_Score
0   Normal   3497  98.897059                      2.41          -1.32
1  Outlier     39   1.102941                     97.28           3.26 

Descriptive Statistics for Max_Amount_60_Days_Ratio (Test Data):
      Group  Count    Count %  Max_Amount_60_Days_Ratio  Anomaly_Score
0   Normal    869  98.303167                      2.09          -1.32
1  Outlier     15   1.696833                     68.52           3.28 

Descriptive Statistics for Max_Amount_90_Days_Ratio (Training Data):
      Group  Count    Count %  Max_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   4162  98.953875                      2.49          -1.55
1  Outlier     44   1.046125                    110.29           3.26 

Descriptive Statistics for Max_Amount_90_Days_Ratio (Test Data):
      Group  Count    Count %  Max_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   1040  98.859316                      2.48          -1.49
1  Outlier     12   1.140684                    103.34           3.28 

Descriptive Statistics for Avg_Transaction_Ratio (Training Data):
      Group  Count    Count %  Avg_Transaction_Ratio  Anomaly_Score
0   Normal   6662  98.828067                   0.55          -2.82
1  Outlier     79   1.171933                  41.50           3.26 

Descriptive Statistics for Avg_Transaction_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_Ratio  Anomaly_Score
0   Normal   1664  98.695136                   0.46          -2.89
1  Outlier     22   1.304864                  33.58           3.30 

Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Transaction_30_Days_Ratio  Anomaly_Score
0   Normal   2636  97.738228                           0.13          -2.37
1  Outlier     61   2.261772                           3.19           3.14 

Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_30_Days_Ratio  Anomaly_Score
0   Normal    647  95.851852                           0.10          -2.50
1  Outlier     28   4.148148                           2.58           3.25 

Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Training Data):
      Group  Count   Count %  Avg_Transaction_60_Days_Ratio  Anomaly_Score
0   Normal   3499  98.95362                           0.25          -2.29
1  Outlier     37   1.04638                           5.34           3.21 

Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_60_Days_Ratio  Anomaly_Score
0   Normal    855  96.719457                           0.15          -2.51
1  Outlier     29   3.280543                           5.63           3.28 

Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Transaction_90_Days_Ratio  Anomaly_Score
0   Normal   4150  98.668569                           0.23          -2.36
1  Outlier     56   1.331431                          11.82           3.24 

Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_90_Days_Ratio  Anomaly_Score
0   Normal   1031  98.003802                           0.23          -2.32
1  Outlier     21   1.996198                          11.44           3.29 

Descriptive Statistics for Avg_Amount_Vendor_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_Ratio  Anomaly_Score
0   Normal  85871  95.909888                     0.59          -1.25
1  Outlier   3662   4.090112                    10.90           2.58 

Descriptive Statistics for Avg_Amount_Vendor_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_Ratio  Anomaly_Score
0   Normal  21468  95.907791                     0.60          -1.25
1  Outlier    916   4.092209                     9.09           2.58 

Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_30_Days_Ratio  Anomaly_Score
0   Normal  11842  98.914133                             0.87          -0.86
1  Outlier    130   1.085867                            25.95           3.27 

Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_30_Days_Ratio  Anomaly_Score
0   Normal   2966  99.097895                             0.89          -0.81
1  Outlier     27   0.902105                            22.71           3.28 

Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_60_Days_Ratio  Anomaly_Score
0   Normal  20616  98.645868                             0.86          -0.95
1  Outlier    283   1.354132                            17.82           3.19 

Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_60_Days_Ratio  Anomaly_Score
0   Normal   5160  98.755981                             0.84          -0.96
1  Outlier     65   1.244019                            26.89           3.20 

Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_90_Days_Ratio  Anomaly_Score
0   Normal  29629  98.915003                             0.81          -0.74
1  Outlier    325   1.084997                            23.53           3.23 

Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_90_Days_Ratio  Anomaly_Score
0   Normal   7409  98.931767                             0.80          -0.75
1  Outlier     80   1.068233                            17.85           3.21 

Key Takeaways

Average Amount by Agency, Merchant:

The outliers from this category have significantly higher average amounts compared to the norm.

  • Overall: Training data (1.08% outliers), Test data (0.77% outliers)
  • 30 Days: Training data (1.08% outliers), Test data (2.81% outliers)
  • 60 Days: Training data (1.33% outliers), Test data (1.02% outliers)
  • 90 Days: Training data (1.50% outliers), Test data (1.81% outliers)

Transactions with high average amounts over the last 30 days are potential outliers. These may signal sudden changes in spending behavior, which could be indicative of special events or fraud. while higher averages across different time period indicated sustained increases in spending, it might be regular large transactions or signs of financial issues.

Max Amount by Agency, Merchant:

Maximum transaction amounts suggest instances of unusually large transactions which could be one-off large purchases or potential fraud.

  • 30 Days: Training data (1.48% outliers), Test data (2.67% outliers)
  • 60 Days: Training data (1.10% outliers), Test data (1.70% outliers)
  • 90 Days: Training data (1.05% outliers), Test data (1.14% outliers)

Transactions over different time period can help in understanding major financial activitives or fraudulent transactions that need further investigation.

Average Transaction by Agency, Merchant:

Higher average transaction ratios suggest frequent high-value transactions, which might be normal for some businesses but could indicate unusual activity in others.

  • Overall: Training data (1.17% outliers), Test data (1.30% outliers)
  • 30 Days: Training data (2.26% outliers), Test data (4.15% outliers)
  • 60 Days: Training data (1.05% outliers), Test data (3.28% outliers)
  • 90 Days: Training data (1.33% outliers), Test data (2.00% outliers)

Outliers in average transactions over 30 days might reflect a sudden spike in transaction frequency or amount, requiring investigation into potential reasons such as promotions or fraud. For longer period, outliers could indicate changing business patterns either business growth or unsual activity.

Average Amoung by Agency, Vendor:

Similarly to Merchant, this category add granularity into the vendor level

  • Overall: Training data (4.09% outliers), Test data (4.09% outliers)
  • 30 Days: Training data (1.09% outliers), Test data (0.90% outliers)
  • 60 Days: Training data (1.35% outliers), Test data (1.24% outliers)
  • 90 Days: Training data (1.08% outliers), Test data (1.07% outliers)

Vendors with high average amounts over the last 30 days could be part of specific promotions or might indicate potential issues like overbilling or fraud.

Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD) ¶

Approach:

  • ECOD uses the empirical cumulative distribution function (ECDF) to model the data. It considers the rank of each data point in the cumulative distribution.

Scoring:

  • The anomaly score is based on the empirical distribution of the data. Points far from the cumulative distribution curve are considered more anomalous.

Advantages:

  • Non-parametric and can handle non-linear relationships and interactions between features.

Limitations:

  • Can be computationally intensive for large datasets. Sensitive to the empirical distribution.
In [19]:
# Define a scorer for GridSearchCV
def ecod_scorer(y_true, y_pred):
    return f1_score(y_true, y_pred)
    
# Apply ECOD to each ratio DataFrame and calculate descriptive statistics
for key, df in ratios.items():
    ratio_col = [col for col in df.columns if col.endswith('Ratio')][0]

    # Prepare the data for ECOD
    X = df[[ratio_col]].values

    # Split data into train and test sets
    X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

    # Define the parameter grid
    param_grid = {
        'contamination': [0.01, 0.03, 0.05, 0.1, 0.2]
    }
    
    # Initialize ECOD model
    ecod = ECOD()

    # Initialize GridSearchCV
    grid_search = GridSearchCV(estimator=ecod, param_grid=param_grid, scoring=make_scorer(ecod_scorer), cv=3)

    # Fit ECOD model on training data
    grid_search.fit(X_train)

    # Get the best model
    best_ecod = grid_search.best_estimator_

    # Get anomaly scores and predictions
    y_train_scores = best_ecod.decision_function(X_train)
    y_train_pred = best_ecod.predict(X_train)
    y_test_scores = best_ecod.decision_function(X_test)
    y_test_pred = best_ecod.predict(X_test)

    # Print statistics
    print(f"Best contamination: {grid_search.best_params_['contamination']}")
    print(f"Training data (counts): {count_stat(y_train_pred)}")
    print(f"Test data (counts): {count_stat(y_test_pred)}")
    print(f"Threshold: {best_ecod.threshold_}")

    # Plot histogram of ECOD scores using Plotly
    fig = make_subplots(rows=1, cols=2, subplot_titles=('Training Data', 'Test Data'))

    fig.add_trace(
        go.Histogram(x=y_train_scores, name='Training'),
        row=1, col=1
    )

    fig.add_trace(
        go.Histogram(x=y_test_scores, name='Test'),
        row=1, col=2
    )

    fig.update_layout(
        title=f"ECOD scores for {ratio_col}",
        xaxis_title="ECOD Score",
        yaxis_title="Frequency",
        showlegend=True,
        template='seaborn',
        height=450,
        width=1000,
        paper_bgcolor='rgba(0,0,0,0)'
    )

    fig.show()

    # Get descriptive statistics for outliers
    stats_train = descriptive_stat_threshold(pd.DataFrame(X_train, columns=[ratio_col]), y_train_scores, best_ecod.threshold_)
    stats_test = descriptive_stat_threshold(pd.DataFrame(X_test, columns=[ratio_col]), y_test_scores, best_ecod.threshold_)

    # Store the statistics
    descriptive_stats[ratio_col] = {'train': stats_train, 'test': stats_test}

# Print descriptive statistics for each ratio
for ratio_col, stats in descriptive_stats.items():
    print(f"Descriptive Statistics for {ratio_col} (Training Data):\n", stats['train'], "\n")
    print(f"Descriptive Statistics for {ratio_col} (Test Data):\n", stats['test'], "\n")
Best contamination: 0.01
Training data (counts): {0: 6673, 1: 68}
Test data (counts): {0: 1672, 1: 14}
Threshold: 4.978719682275966
Best contamination: 0.01
Training data (counts): {0: 2671, 1: 26}
Test data (counts): {0: 670, 1: 5}
Threshold: 5.260837993524471
Best contamination: 0.01
Training data (counts): {0: 3500, 1: 36}
Test data (counts): {0: 878, 1: 6}
Threshold: 5.2614561384167775
Best contamination: 0.01
Training data (counts): {0: 4164, 1: 42}
Test data (counts): {0: 1039, 1: 13}
Threshold: 5.253224902904329
Best contamination: 0.01
Training data (counts): {0: 2671, 1: 26}
Test data (counts): {0: 670, 1: 5}
Threshold: 5.260837993524471
Best contamination: 0.01
Training data (counts): {0: 3500, 1: 36}
Test data (counts): {0: 874, 1: 10}
Threshold: 5.2614561384167775
Best contamination: 0.01
Training data (counts): {0: 4164, 1: 42}
Test data (counts): {0: 1040, 1: 12}
Threshold: 5.253224902904329
Best contamination: 0.01
Training data (counts): {0: 6673, 1: 68}
Test data (counts): {0: 1670, 1: 16}
Threshold: 4.5906163359088765
Best contamination: 0.01
Training data (counts): {0: 2670, 1: 27}
Test data (counts): {0: 668, 1: 7}
Threshold: 4.569145518731359
Best contamination: 0.01
Training data (counts): {0: 3500, 1: 36}
Test data (counts): {0: 874, 1: 10}
Threshold: 4.577642844335586
Best contamination: 0.01
Training data (counts): {0: 4163, 1: 43}
Test data (counts): {0: 1040, 1: 12}
Threshold: 4.5819177646578435
Best contamination: 0.01
Training data (counts): {0: 88637, 1: 896}
Test data (counts): {0: 22163, 1: 221}
Threshold: 5.296855829268316
Best contamination: 0.01
Training data (counts): {0: 11852, 1: 120}
Test data (counts): {0: 2965, 1: 28}
Threshold: 5.284245502365114
Best contamination: 0.01
Training data (counts): {0: 20691, 1: 208}
Test data (counts): {0: 5174, 1: 51}
Threshold: 5.293496239760728
Best contamination: 0.01
Training data (counts): {0: 29654, 1: 300}
Test data (counts): {0: 7423, 1: 66}
Threshold: 5.293261248815198
Descriptive Statistics for Avg_Amount_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Ratio  Anomaly_Score
0   Normal   6673  98.991248              0.83           1.64
1  Outlier     68   1.008752             19.53           6.15 

Descriptive Statistics for Avg_Amount_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Ratio  Anomaly_Score
0   Normal   1672  99.169632              0.85           1.63
1  Outlier     14   0.830368             11.33           6.01 

Descriptive Statistics for Avg_Amount_30_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_30_Days_Ratio  Anomaly_Score
0   Normal   2669  98.961809                      0.88           1.64
1  Outlier     28   1.038191                     12.68           6.10 

Descriptive Statistics for Avg_Amount_30_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_30_Days_Ratio  Anomaly_Score
0   Normal    670  99.259259                      0.95           1.66
1  Outlier      5   0.740741                     10.97           6.67 

Descriptive Statistics for Avg_Amount_60_Days_Ratio (Training Data):
      Group  Count  Count %  Avg_Amount_60_Days_Ratio  Anomaly_Score
0   Normal   3500  98.9819                      0.89           1.64
1  Outlier     36   1.0181                     17.12           6.15 

Descriptive Statistics for Avg_Amount_60_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_60_Days_Ratio  Anomaly_Score
0   Normal    878  99.321267                      0.90           1.64
1  Outlier      6   0.678733                      6.03           5.96 

Descriptive Statistics for Avg_Amount_90_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   4162  98.953875                      0.87           1.64
1  Outlier     44   1.046125                     17.79           6.14 

Descriptive Statistics for Avg_Amount_90_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   1039  98.764259                      0.86           1.63
1  Outlier     13   1.235741                      7.05           5.99 

Descriptive Statistics for Max_Amount_30_Days_Ratio (Training Data):
      Group  Count    Count %  Max_Amount_30_Days_Ratio  Anomaly_Score
0   Normal   2669  98.961809                      2.33           1.64
1  Outlier     28   1.038191                     60.12           6.10 

Descriptive Statistics for Max_Amount_30_Days_Ratio (Test Data):
      Group  Count    Count %  Max_Amount_30_Days_Ratio  Anomaly_Score
0   Normal    670  99.259259                      2.44           1.65
1  Outlier      5   0.740741                     80.52           6.32 

Descriptive Statistics for Max_Amount_60_Days_Ratio (Training Data):
      Group  Count  Count %  Max_Amount_60_Days_Ratio  Anomaly_Score
0   Normal   3500  98.9819                      2.67           1.64
1  Outlier     36   1.0181                     79.80           6.15 

Descriptive Statistics for Max_Amount_60_Days_Ratio (Test Data):
      Group  Count    Count %  Max_Amount_60_Days_Ratio  Anomaly_Score
0   Normal    874  98.868778                      2.50           1.61
1  Outlier     10   1.131222                     66.04           5.97 

Descriptive Statistics for Max_Amount_90_Days_Ratio (Training Data):
      Group  Count    Count %  Max_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   4162  98.953875                      2.77           1.64
1  Outlier     44   1.046125                     83.89           6.14 

Descriptive Statistics for Max_Amount_90_Days_Ratio (Test Data):
      Group  Count    Count %  Max_Amount_90_Days_Ratio  Anomaly_Score
0   Normal   1040  98.859316                      2.88           1.65
1  Outlier     12   1.140684                     69.01           6.00 

Descriptive Statistics for Avg_Transaction_Ratio (Training Data):
      Group  Count    Count %  Avg_Transaction_Ratio  Anomaly_Score
0   Normal   6673  98.991248                   0.56           1.38
1  Outlier     68   1.008752                  46.73           5.55 

Descriptive Statistics for Avg_Transaction_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_Ratio  Anomaly_Score
0   Normal   1670  99.051008                    0.5           1.36
1  Outlier     16   0.948992                   42.0           5.47 

Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Transaction_30_Days_Ratio  Anomaly_Score
0   Normal   2670  98.998888                           0.14           1.21
1  Outlier     27   1.001112                           6.10           5.51 

Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_30_Days_Ratio  Anomaly_Score
0   Normal    668  98.962963                           0.15           1.22
1  Outlier      7   1.037037                           5.26           5.29 

Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Training Data):
      Group  Count  Count %  Avg_Transaction_60_Days_Ratio  Anomaly_Score
0   Normal   3500  98.9819                           0.20           1.28
1  Outlier     36   1.0181                          10.31           5.51 

Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_60_Days_Ratio  Anomaly_Score
0   Normal    874  98.868778                           0.22           1.26
1  Outlier     10   1.131222                           9.91           5.42 

Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Transaction_90_Days_Ratio  Anomaly_Score
0   Normal   4163  98.977651                           0.25           1.31
1  Outlier     43   1.022349                          14.08           5.52 

Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Transaction_90_Days_Ratio  Anomaly_Score
0   Normal   1040  98.859316                           0.27           1.33
1  Outlier     12   1.140684                          16.21           5.74 

Descriptive Statistics for Avg_Amount_Vendor_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_Ratio  Anomaly_Score
0   Normal  88637  98.999252                     0.79           1.62
1  Outlier    896   1.000748                    23.13           6.29 

Descriptive Statistics for Avg_Amount_Vendor_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_Ratio  Anomaly_Score
0   Normal  22163  99.012688                     0.79           1.62
1  Outlier    221   0.987312                    16.95           6.35 

Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_30_Days_Ratio  Anomaly_Score
0   Normal  11852  98.997661                             0.93           1.64
1  Outlier    120   1.002339                            22.62           6.24 

Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_30_Days_Ratio  Anomaly_Score
0   Normal   2965  99.064484                             0.93           1.64
1  Outlier     28   0.935516                            17.28           6.39 

Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_60_Days_Ratio  Anomaly_Score
0   Normal  20689  98.995167                             0.89           1.63
1  Outlier    210   1.004833                            20.21           6.26 

Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_60_Days_Ratio  Anomaly_Score
0   Normal   5174  99.023923                             0.87           1.62
1  Outlier     51   0.976077                            30.93           6.26 

Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Training Data):
      Group  Count    Count %  Avg_Amount_Vendor_90_Days_Ratio  Anomaly_Score
0   Normal  29654  98.998464                             0.87           1.63
1  Outlier    300   1.001536                            19.50           6.27 

Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Test Data):
      Group  Count    Count %  Avg_Amount_Vendor_90_Days_Ratio  Anomaly_Score
0   Normal   7423  99.118707                             0.87           1.63
1  Outlier     66   0.881293                            14.33           6.18 

Key Takeaways

Average Amount by Agency, Merchant:

  • Overall: Training data (1.01% outliers), Test data (0.83% outliers)
  • 30 Days: Training data (1.04% outliers), Test data (0.74% outliers)
  • 60 Days: Training data (1.02% outliers), Test data (0.68% outliers)
  • 90 Days: Training data (1.05% outliers), Test data (1.24% outliers)

Max Amount by Agency, Merchant:

  • 30 Days: Training data (1.04% outliers), Test data (0.74% outliers)
  • 60 Days: Training data (1.02% outliers), Test data (1.13% outliers)
  • 90 Days: Training data (1.05% outliers), Test data (1.14% outliers)

Average Transaction by Agency, Merchant:

  • Overall: Training data (1.01% outliers), Test data (0.95% outliers)
  • 30 Days: Training data (1.00% outliers), Test data (1.04% outliers)
  • 60 Days: Training data (1.02% outliers), Test data (1.13% outliers)
  • 90 Days: Training data (1.02% outliers), Test data (1.14% outliers)

Average Amoung by Agency, Vendor:

  • Overall: Training data (1.00% outliers), Test data (0.99% outliers)
  • 30 Days: Training data (1.00% outliers), Test data (0.94% outliers)
  • 60 Days: Training data (1.00% outliers), Test data (0.98% outliers)
  • 90 Days: Training data (1.00% outliers), Test data (0.88% outliers)

Section 6: Models Predictions Comparison¶

In [20]:
# HBOS
hbos = HBOS(n_bins=600, contamination=0.01)
hbos.fit(X_train)
y_train_hbos_pred = hbos.labels_
y_test_hbos_pred = hbos.predict(X_test)
y_train_hbos_scores = hbos.decision_function(X_train)
y_test_hbos_scores = hbos.decision_function(X_test)

# ECOD
ecod = ECOD(contamination=0.01)
ecod.fit(X_train)
y_train_ecod_pred = ecod.labels_
y_test_ecod_pred = ecod.predict(X_test)
y_train_ecod_scores = ecod.decision_scores_  # raw outlier scores
y_test_ecod_scores = ecod.decision_function(X_test)

# Thresholds
thresholds = [ecod.threshold_, hbos.threshold_]
print("Thresholds for ECOD and HBOS:", thresholds, '\n')

# Compare HBOS and ECOD predictions
comparison_df = pd.DataFrame({'HBOS_pred': y_test_hbos_pred, 'ECOD_pred': y_test_ecod_pred})

print(pd.crosstab(comparison_df['HBOS_pred'], comparison_df['ECOD_pred']))
Thresholds for ECOD and HBOS: [5.293261248815198, 3.2885504005909816] 

ECOD_pred     0   1
HBOS_pred          
0          7405  30
1            18  36
In [21]:
# Compute confusion matrix
cm = pd.crosstab(y_test_hbos_pred, y_test_ecod_pred)

# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
                    x=['Predicted 0', 'Predicted 1'],
                    y=['True 0', 'True 1'],
                    showscale=False,
                    colorscale=[
                        [0.0, "#e3e7e6"],  
                        [1.0, "#eac06c"]   
                    ],
                    xgap=20,
                    ygap=20,
                    text=cm,
                    texttemplate="%{text}")

# Define layout
layout = go.Layout(
    title=dict(text="Confusion Matrix between HBOS and ECOD", x=0.5, y=0.9, xanchor='center', yanchor='top'),
    xaxis=dict(title='Predicted label', showticklabels=True),
    yaxis=dict(title='True label', showticklabels=True),
    autosize=False,
    width=500,
    height=500
)

# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()

Key Takeways from Crosstab:

Agree on Normal:

  • Both HBOS and ECOD predicted 7405 instances as normal (0)

Agree on Outliers:

  • Both HBOS and ECOD predicted 36 instances as outliers (1)

Discrepancy in Predictions:

  • There are 30 instances where HBOS predicted normal (0), but ECOD predicted outliers (1)
  • There are 18 instances where HBOS predicted outliers (1), but ECOD predicted normal (0). This indicates that HBOS is more sensitive, potentially identifying more outliers but including some that ECOD considers normal.

The high count (7405) of instances predicted as normal by both models indicates a strong agreement on what constitutes normal behavior.

Both models agreed on 36 instances being outliers. This agreement suggests that these instances are likely true outliers, as both models identified them as such despite their different methodologies.

ECOD is stricter with a higher threshold, leading to fewer false positives and potentially misses some outliers that HBOS catches (higher false negatives). On the other hand, HBOS is more sensitive with a lower threshold, leading to more outliers detected, including some instances that ECOD considers normal (higher false postives)

Section 7: Conclusion¶

Detection Method Comparison:

HBOS tends to identify fewer outliers with scores more tightly clustered around the center. ECOD detects a broader range of outliers, showing more granularity in the outlier scores.

Outliers Locations:

Outliers are generally detected with higher scores (above 2 for HBOS and above 5 for ECOD). These outliers are found at the tails of the distribution in both training and test datasets.

In the context of credit card transaction anomaly detection, both models have their strengths and weaknesses. ECOD is more aggressive in flagging anomalies, which might be better for catching more fraudulent transactions at the cost of higher false positives. HBOS, being more conservative, might result in fewer disruptions to legitimate transactions but may miss some anomalies. Depending on the specific goals and risk tolerance of the anomaly detection system, either model, or a combination of both, could be employed to enhance the overall detection capability.